9 - Rechnerarchitektur [ID:10877]
50 von 1307 angezeigt

And the state of this point here results from its own and from that of the neighbors.

This is a classic stencil operation.

Yes, and it is now filled with meat accordingly.

And if it has the coordinates I, J, then, yes, it results here, for example, at the time T at the point IJ.

That is then T plus 1, that is the temperature at the point IJ at the time T and plus the neighbors are all weighted

and weighted by the factor 1 quarter in 1, 2, 3, 4.

And that applies to the time T and then we have each of the four neighbors in a right-angled neighborhood.

So, exactly. So that would be, for example, such a stencil.

Yes, exactly, and then I'm counting here now, well, if it's all flow comma operations,

in this case 0.25 is easy again, but if we take the flow comma operation at 0.25 is a quarter,

I could also generate that by two shifts to the right, but I now calculate the flow comma operation with it,

then I would have 1, 2, 3, 4, 5 flow comma operations.

I have to get all the data together once, 1, 2, 3, 4, 5.

Let's say that would be all float variables, so 4 bytes, then I would have 20 bytes,

20 bytes for bytes for 5 float variables.

And I'm practicing exactly how many flow comma operations out of 5.

1, 2, 3, 4, 5, 4 additions and I'm calculating them now, I'm not doing that over a shift,

then that's 5 flow comma operations, that means 5 by 20 and that's my arithmetic intensity.

Wait a minute, stop, I'm calculating the view.

I'm calculating the flops per byte, so on top are the flops, 20, I have 5 flops exactly and 20 bytes,

so that's a quarter flop per byte.

And accordingly it lands on the x-axis here, we can no longer see it, but here it would be a quarter.

And that is now the arithmetic intensity, I can get it out of the arithmetic,

from the description of my algorithm, from the description of my kernel.

And this would normally run in a loop, a loop where I iterate over t

and calculate different states at a certain time.

So, by the way, I should just say that this is representative for one point here in my grid system,

the operations run on all grid points at the same time.

This makes this problem nice and parallelizable.

Okay, that's clear.

Good.

And, oh yes, right, that's the arithmetic intensity.

I can get it out of the description of my algorithm

and assume that the access here costs me one bar.

They are already there, so to speak, they are already in cache.

This is usually not the case, in general it will be a little worse

because I have to get them out of the memory first.

And that is the operational intensity.

That means, if I look at the operational intensity, then it's 20 bytes per access here,

that won't work, because they have to be made from the memory first.

So it will be a little less.

Not 20 bytes per access, but a little less, because I have the latency of the DRAM again.

And this is taken into account and I have to measure it.

I can't determine it without looking at my algorithm.

So that would be the operational intensity and this is exactly what you can see here.

So, here I am again.

No, this is not to be seen, but here later on in the other graph,

exactly here is the arithmetic intensity to be seen.

So, yes, this example here, let's sum it up,

it's taken from the book by Patterson, Computer Organization and Design.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:20:43 Min

Aufnahmedatum

2017-12-18

Hochgeladen am

2019-05-01 10:39:03

Sprache

de-DE

Die Vorlesung baut auf die in den Grundlagen der Rechnerarchitektur und -organisation vermittelten Inhalte auf und setzt diese mit weiterführenden Themen fort. Es werden zunächst grundlegende fortgeschrittene Techniken bei Pipelineverarbeitung und Cachezugriffen in modernen Prozessoren und Parallelrechnern behandelt. Ferner wird die Architektur von Spezialprozessoren, z.B. DSPs und Embedded Prozessoren behandelt. Es wird aufgezeigt, wie diese Techniken in konkreten Architekturen (Intel Nehalem, GPGPU, Cell BE, TMS320 DSP, Embedded Prozessor ZPU) verwendet werden. Zur Vorlesung werden eine Tafel- und eine Rechnerübung angeboten, durch deren erfolgreiche Beteiligung abgestuft mit der Vorlesung 5 bzw. 7,5 ECTS erworben werden können. In den Tafelübungen werden die in der Vorlesung vermittelten Techniken durch zu lösende Aufgaben vertieft. In der Rechnerübung soll u.a. ein einfacher Vielkern-Prozessor auf Basis des ZPU-Prozessors mit Simulationswerkzeugen aufgebaut werden. Im Einzelnen werden folgende Themen behandelt:
  • Organisationsaspekte von CISC und RISC-Prozessoren

  • Behandlung von Hazards in Pipelines

  • Fortgeschrittene Techniken der dynamischen Sprungvorhersage

  • Fortgeschritten Cachetechniken, Cache-Kohärenz

  • Ausnutzen von Cacheeffekten

  • Architekturen von Digitalen Signalprozessoren

  • Architekturen homogener und heterogener Multikern-Prozessoren (Intel Corei7, Nvidia GPUs, Cell BE)

  • Architektur von Parallelrechnern (Clusterrechner, Superrechner)

  • Effiziente Hardware-nahe Programmierung von Mulitkern-Prozessoren (OpenMP, SSE, CUDA, OpenCL)

  • Leistungsmodellierung und -analyse von Multikern-Prozessoren (Roofline-Modell)

Empfohlene Literatur
  • Patterson/Hennessy: Computer Organization und Design
  • Hennessy/Patterson: Computer Architecture - A Quantitative Approach

  • Stallings: Computer Organization and Architecture

  • Märtin: Rechnerarchitekturen

Einbetten
Wordpress FAU Plugin
iFrame
Teilen